ADM - Introduction to Classification models

Outline: Introduction to Classification models
¶

  • Pendahuluan Model Klasifikasi
  • k-Nearest Neighbour
  • Evaluasi Dasar
  • Underfitting-Overfitting
  • Cross-validasi
  • Regresi Logistik
  • Naive Bayes
  • Decision Tree dan Random Forest

No description has been provided for this image

Variabel target (dependent) dan prediktor (independent)
¶

No description has been provided for this image

  • Variable Target: adalah satu atau lebih variabel yang dipengaruhi oleh satu atau lebih variabel yang lain. Contoh: Variabel gaji dipengaruhi oleh variabel lama kerja, pangkat serta jabatan seorang pegawai. Variable
  • Variabel Prediktor : adalah satu atau lebih variabel yang mempengaruhi satu atau lebih variabel yang lain. Contoh: Variabel kecepatan mempengaruhi waktu tempuh perjalanan.
  • Variabel Kontrol: adalah variabel/elemen yang nilainya tetap (konstan), biasanya pada suatu eksperimen untuk menguji hubungan antara variabel target dan prediktor. Contoh: Penggunaan Placebo (obat palsu) pada penelitian/eksperimen efek suatu obat tertentu.
  • Variable Confounding Biasa juga disebut sebagai “variabel ketiga” atau “variabel mediator”, yaitu suatu (extra*) variabel yang mempengaruhi hubungan antara variabel dependent dan independent. Contoh: Pada penelitian tentang dampak olahraga (prediktor) terhadap berat badan (target), maka variabel lain seperti pola makan dan usia juga akan mempengaruhi.

Pentingnya Domain Knowledge
¶

Bentuk Struktur Data Masalah Klasifikasi
¶

  • Klasifikasi adalah permasalahan meng-kategorisasikan sekelompok observasi baru ke sekumpulan kategori (kelas) yang ada sebelumnya.
  • Mengacu ke Gambar dibawah, klasifikasi digunakan jika variabel target bertipe kategorik dan prediktornya satu atau lebih variabel numerik dan-atau kategorik.

Aplikasi Model Klasifikasi
¶

Berbagai Pendekatan ke Klasifikasi
¶

  • Terdapat cukup banyak model klasifikasi yang dapat digunakan, mulai dari yang klasik seperti Linear Discriminant Analysis (LDA) dan regresi logistik, lalu ke moderate seperti SVM (support vector machines), decision tree dan neural network (jaringan syaraf tiruan), sampai yang lebih terkini seperti random forest, dan deep learning.
  • Masing-masing memiliki kelebihan dan kekurangan masing-masing bergantung pada bagaimana model/algoritmanya.

Induktif Bias Sebagai Dasar Penting untuk Mengerti SEMUA model Data Science dan Machine Learning
¶

image source: https://sgfin.github.io/2020/06/22/Induction-Intro/

Permasalahan Klasifikasi
¶

  • Misal diberikan permasalahan terdapat dua buah kategori orange dan ungu seperti di gambar.
  • Setiap titik di ganmbar adalah entitas dari data yang terdiri dari beberapa variabel.
  • Jika diberikan titik baru (warna putih), maka masalah klasifikkasi adalah kemudian menggolongkan data baru ini ke kategori titik Orange atau Ungu.

Mari membahas teori Bersamaan Dengan Implementasinya¶

In [1]:
# !pip install graphviz dtreeviz # Jika dijalankan di Google Colab
In [2]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, matplotlib.pyplot as plt
import time, numpy as np, seaborn as sns
from sklearn import  tree
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
#from dtreeviz.trees import *
#import graphviz
from sklearn import svm, preprocessing
from sklearn.gaussian_process.kernels import RBF
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
sns.set(style="ticks", color_codes=True)
"Done"
Out[2]:
'Done'

Kasus Sederhana Klasifikasi 01: Klasifikasi Spesies Bunga Iris
¶

  • Data klasifikasi bunga Iris sebagai studi kasus sederhana
  • Link data: https://archive.ics.uci.edu/ml/datasets/iris
  • Paper sumber data: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
  • Masalah klasifikasinya adalah mengklasifikasikan jenis Bunga Iris berdasarkan bentuk (e.g. panjang dan lebar) bunga.

In [3]:
# Load data Bunga Iris
data = sns.load_dataset("iris")
print(data.shape)
data.sample(5)
(150, 5)
Out[3]:
sepal_length sepal_width petal_length petal_width species
147 6.5 3.0 5.2 2.0 virginica
33 5.5 4.2 1.4 0.2 setosa
104 6.5 3.0 5.8 2.2 virginica
18 5.7 3.8 1.7 0.3 setosa
64 5.6 2.9 3.6 1.3 versicolor
In [4]:
data['species'] = data['species'].astype('category')
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   sepal_length  150 non-null    float64 
 1   sepal_width   150 non-null    float64 
 2   petal_length  150 non-null    float64 
 3   petal_width   150 non-null    float64 
 4   species       150 non-null    category
dtypes: category(1), float64(4)
memory usage: 5.1 KB
None
In [5]:
print("Duplikasi = ", data.duplicated().sum())
print(data.isnull().sum())
Duplikasi =  1
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
In [6]:
data.drop_duplicates(keep="first", inplace=True)
print("Duplikasi = ", data.duplicated().sum())
Duplikasi =  0
In [7]:
p = sns.pairplot(data, hue="species")
No description has been provided for this image
In [8]:
# Kita membuat dataframe baru, hati-hati jika datanya besar.
df1 = data[['sepal_length','sepal_width','petal_length','petal_width']]
y1 = data['species']
df1.shape, y1.shape
Out[8]:
((149, 4), (149,))

Kasus Sederhana Klasifikasi 02: Efisiensi Energy Gedung
¶

  • Terdapat 12 Macam bentuk Gedung disimulasikan dalam EcoTect. Gedung-gedung tersebut berbeda menurut beberapa parameter (e.g. glazing area, the glazing area distribution, and the orientation).
  • Dari parameter tadi terdapat 768 bentuk gedung dan 8 variabel.
  • aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.
  • Link data: https://archive.ics.uci.edu/ml/datasets/energy+efficiency
  • Paper sumber data: A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012

In [9]:
file_ = "data/building-energy-efficiency-ENB2012_data.csv"

try: # Running Locally, yakinkan "file_" berada di folder "data"
    data = pd.read_csv(file_)
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/{file_}
    data = pd.read_csv(file_)
print(data.shape)
data.sample(5)
(768, 12)
Out[9]:
compactness surface-area wall-area roof-area overall-height orientation glazing-area glazing-dist heating-load cooling-load heat-cat cool-cat
220 0.71 710.5 269.5 220.5 3.5 2 0.10 4 10.66 13.67 10 13
545 0.79 637.0 343.0 147.0 7.0 3 0.40 1 42.62 39.07 42 39
568 0.64 784.0 343.0 220.5 3.5 2 0.40 1 19.52 22.72 19 22
359 0.76 661.5 416.5 122.5 7.0 5 0.25 2 36.45 36.81 36 36
421 0.66 759.5 318.5 220.5 3.5 3 0.25 3 13.01 15.80 13 15

PreProcessing & Minor EDA
¶

  • Preprocessing apa yang diperlukan?
In [10]:
print(data.info())
print(set(data["orientation"]))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   compactness     768 non-null    float64
 1   surface-area    768 non-null    float64
 2   wall-area       768 non-null    float64
 3   roof-area       768 non-null    float64
 4   overall-height  768 non-null    float64
 5   orientation     768 non-null    int64  
 6   glazing-area    768 non-null    float64
 7   glazing-dist    768 non-null    int64  
 8   heating-load    768 non-null    float64
 9   cooling-load    768 non-null    float64
 10  heat-cat        768 non-null    int64  
 11  cool-cat        768 non-null    int64  
dtypes: float64(8), int64(4)
memory usage: 72.1 KB
None
{2, 3, 4, 5}
In [11]:
data['orientation'] = data['orientation'].astype('category')
data['heat-cat'] = data['heat-cat'].astype('category')
data['cool-cat'] = data['cool-cat'].astype('category')
data.describe(include="all")
Out[11]:
compactness surface-area wall-area roof-area overall-height orientation glazing-area glazing-dist heating-load cooling-load heat-cat cool-cat
count 768.000000 768.000000 768.000000 768.000000 768.00000 768.0 768.000000 768.00000 768.000000 768.000000 768.0 768.0
unique NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN 37.0 39.0
top NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN 12.0 14.0
freq NaN NaN NaN NaN NaN 192.0 NaN NaN NaN NaN 84.0 82.0
mean 0.764167 671.708333 318.500000 176.604167 5.25000 NaN 0.234375 2.81250 22.307195 24.587760 NaN NaN
std 0.105777 88.086116 43.626481 45.165950 1.75114 NaN 0.133221 1.55096 10.090204 9.513306 NaN NaN
min 0.620000 514.500000 245.000000 110.250000 3.50000 NaN 0.000000 0.00000 6.010000 10.900000 NaN NaN
25% 0.682500 606.375000 294.000000 140.875000 3.50000 NaN 0.100000 1.75000 12.992500 15.620000 NaN NaN
50% 0.750000 673.750000 318.500000 183.750000 5.25000 NaN 0.250000 3.00000 18.950000 22.080000 NaN NaN
75% 0.830000 741.125000 343.000000 220.500000 7.00000 NaN 0.400000 4.00000 31.667500 33.132500 NaN NaN
max 0.980000 808.500000 416.500000 220.500000 7.00000 NaN 0.400000 5.00000 43.100000 48.030000 NaN NaN
In [12]:
print("Duplikasi = ", data.duplicated().sum())
print(data.isnull().sum())
Duplikasi =  0
compactness       0
surface-area      0
wall-area         0
roof-area         0
overall-height    0
orientation       0
glazing-area      0
glazing-dist      0
heating-load      0
cooling-load      0
heat-cat          0
cool-cat          0
dtype: int64
In [13]:
# Warning agak lambat karena plot yg di generate cukup banyak 
col_ = "surface-area wall-area roof-area overall-height heat-cat".split()
p = sns.pairplot(data[col_], hue="heat-cat")
No description has been provided for this image
In [14]:
# Challenge of the prediction
print(data["heat-cat"].value_counts())
p = sns.countplot(x="heat-cat", data=data)
12    84
14    67
32    56
11    45
15    45
10    43
28    43
29    38
36    31
24    31
16    27
13    26
35    20
17    17
39    16
40    16
26    16
25    16
19    13
33    12
31    12
6     12
23    12
18    10
41    10
38     9
42     8
27     5
22     5
37     5
7      4
34     4
8      4
30     2
20     2
21     1
43     1
Name: heat-cat, dtype: int64
No description has been provided for this image
In [15]:
data["orientation"].value_counts()
Out[15]:
2    192
3    192
4    192
5    192
Name: orientation, dtype: int64
In [16]:
# One-hot encoding, lalu menggabungkan dengan data awal
dum_ = pd.get_dummies(data['orientation'], prefix='ori')
data = pd.concat([data, dum_], axis = 1)
data.head()
Out[16]:
compactness surface-area wall-area roof-area overall-height orientation glazing-area glazing-dist heating-load cooling-load heat-cat cool-cat ori_2 ori_3 ori_4 ori_5
0 0.98 514.5 294.0 110.25 7.0 2 0.0 0 15.55 21.33 15 21 1 0 0 0
1 0.98 514.5 294.0 110.25 7.0 3 0.0 0 15.55 21.33 15 21 0 1 0 0
2 0.98 514.5 294.0 110.25 7.0 4 0.0 0 15.55 21.33 15 21 0 0 1 0
3 0.98 514.5 294.0 110.25 7.0 5 0.0 0 15.55 21.33 15 21 0 0 0 1
4 0.90 563.5 318.5 122.50 7.0 2 0.0 0 20.84 28.28 20 28 1 0 0 0
In [17]:
df2A = data[['compactness', 'surface-area', 'wall-area', 'roof-area', \
            'overall-height','orientation','glazing-area','glazing-dist']]
df2B = data[['compactness', 'surface-area', 'wall-area', 'roof-area', \
            'overall-height','ori_2', 'ori_3', 'ori_4', 'ori_5','glazing-area','glazing-dist']]
y2 = data['heat-cat']
df2B.head()
Out[17]:
compactness surface-area wall-area roof-area overall-height ori_2 ori_3 ori_4 ori_5 glazing-area glazing-dist
0 0.98 514.5 294.0 110.25 7.0 1 0 0 0 0.0 0
1 0.98 514.5 294.0 110.25 7.0 0 1 0 0 0.0 0
2 0.98 514.5 294.0 110.25 7.0 0 0 1 0 0.0 0
3 0.98 514.5 294.0 110.25 7.0 0 0 0 1 0.0 0
4 0.90 563.5 318.5 122.50 7.0 1 0 0 0 0.0 0

Sebelum dimulai Data Kita Pisahkan Menjadi Train dan Test: Mengapa?
¶

No description has been provided for this image

  • Bagaimana membagi Porsi Train VS Porsi Test Data?
  • Hati-hati dalam memisahkan data Train dan Test: Mengapa?
In [18]:
df1_train, df1_test, y1_train, y1_test = train_test_split(df1, y1, test_size=0.3, random_state=33)
df2A_train, df2A_test, y2_train, y2_test = train_test_split(df2A, y2, test_size=0.3, random_state=33) #No One-Hot
df2B_train, df2B_test, y2_train, y2_test = train_test_split(df2B, y2, test_size=0.3, random_state=33) # One-Hot
print(df1_train.shape, df1_test.shape)
print(df2A_train.shape, df2A_test.shape)
(104, 4) (45, 4)
(537, 8) (231, 8)

Sampel ke Populasi: Underfitting dan overfitting
¶

No description has been provided for this image

Parsimoni: Simple is the Best
¶

k-Nearest Neighbour
¶

  • Classifier yang paling sederhana, namun dapat juga digunakan untuk regresi (dan bahkan clustering).
  • Sering disebut sebagai Instance based Learner
  • Tidak memiliki "persamaan", pendekatannya lebih ke algoritmik berdasarkan konsep jarak/similarity
  • Mirip konsep DBSCAN

k-NN Neighbour Size & Weights
¶

  • Uniform: all points in each neighborhood are weighted equally.
  • Distance: closer neighbors of a query point have a greater influence than the neighbors further away.

Similarity VS Distance
¶

Similarity explained in plain terms and its application in Python¶

http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/¶

Kelebihan dan Kekurangan
¶

Pros:
  • Relatif cepat (efisien) untuk data yang tidak terlalu besar
  • Sederhana, mudah untuk diimplementasikan
  • Mudah untuk di modifikasi: Berbagai macam formula jarak/similaritas
  • Menangani data Multiclass dengan mudah
  • Akurasi cukup baik jika data representatif
Cons:
  • Menemukan nearest neighbours tidak efisien untuk data besar
  • Storage of data
  • Meyakinkan rumus jarak yang tepat

Aplikasi di Python¶

In [19]:
# k-NN: http://scikit-learn.org/stable/modules/neighbors.html
n_neighbors = 3
weights = 'distance'
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNN.fit(df1_train, y1_train)
print('Done!')
Done!
In [20]:
# Prediksi dengan k-NN
y_kNN1 = kNN.predict(df1_test)
y_kNN1[-10:]
Out[20]:
array(['virginica', 'virginica', 'versicolor', 'versicolor', 'setosa',
       'versicolor', 'versicolor', 'versicolor', 'setosa', 'setosa'],
      dtype=object)
In [21]:
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNN.fit(df2B_train, y2_train)
y_kNN2 = kNN.predict(df2B_test)
y_kNN2[-10:]
Out[21]:
array([13, 25, 29, 29, 29, 17, 10, 39, 10, 11], dtype=int64)

Seberapa Baik Hasil Prediksi Ini?: Evaluation Metrics
¶

  • https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Confusion Matrix
¶

  • sensitivity, recall, hit rate, or true positive rate (TPR)
  • precision or positive predictive value (PPV)

* Yang mana kategori yang "positif"?

  • $0\leq F\leq 1$, 1 optimal value
  • $0\leq\beta< \inf$
  • beta < 1 lends more weight to precision,
  • beta > 1 favors recall
  • beta -> 0 considers only precision
  • beta -> inf only recall

Micro VS Macro Metric
¶

Pada 2 kasus diatas:¶

  • sebaiknya Micro atau macro?
  • Mana yang lebih penting Presisi atau Recall?
In [22]:
print("Kasus 01 - Bunga Iris: kNN")
print(confusion_matrix(y1_test, y_kNN1))
print(classification_report(y1_test, y_kNN1))
Kasus 01 - Bunga Iris: kNN
[[11  0  0]
 [ 0 15  0]
 [ 0  2 17]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.88      1.00      0.94        15
   virginica       1.00      0.89      0.94        19

    accuracy                           0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45

In [23]:
print("Kasus 02 - Building Energy")
print(confusion_matrix(y2_test, y_kNN2))
print(classification_report(y2_test, y_kNN2))
Kasus 02 - Building Energy
[[1 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 1 0 0]]
              precision    recall  f1-score   support

           6       1.00      0.33      0.50         3
           7       1.00      0.50      0.67         2
           8       0.00      0.00      0.00         1
          10       0.13      0.15      0.14        13
          11       0.15      0.14      0.15        14
          12       0.06      0.08      0.07        25
          13       0.06      0.12      0.08         8
          14       0.00      0.00      0.00        22
          15       0.22      0.17      0.19        12
          16       0.18      0.25      0.21         8
          17       0.00      0.00      0.00         2
          18       0.00      0.00      0.00         1
          19       0.00      0.00      0.00         5
          20       0.00      0.00      0.00         1
          21       0.00      0.00      0.00         1
          22       0.00      0.00      0.00         1
          23       0.00      0.00      0.00         6
          24       0.07      0.12      0.09         8
          25       0.00      0.00      0.00         2
          26       0.00      0.00      0.00         6
          27       0.00      0.00      0.00         1
          28       0.08      0.06      0.07        16
          29       0.00      0.00      0.00        11
          30       0.00      0.00      0.00         1
          31       0.00      0.00      0.00         3
          32       0.06      0.07      0.07        14
          33       0.00      0.00      0.00         5
          34       0.00      0.00      0.00         2
          35       0.25      0.14      0.18         7
          36       0.00      0.00      0.00        10
          37       0.00      0.00      0.00         3
          38       1.00      0.17      0.29         6
          39       0.00      0.00      0.00         3
          40       0.00      0.00      0.00         4
          41       0.33      0.33      0.33         3
          42       0.00      0.00      0.00         1

    accuracy                           0.08       231
   macro avg       0.13      0.07      0.08       231
weighted avg       0.11      0.08      0.09       231

Cross Validation
¶

  • Evaluasi yang kita lakukan belum cukup valid/objektif ... Mengapa?

In [24]:
# Cross validation
# Perhatikan variabelnya, kita sekarang menggunakan seluruh data
# namun sebaiknya hanya Train data (jika datanya cukup besar)
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
mulai = time.time()
scores_kNN = cross_val_score(kNN, df1, y1, cv=10)
waktu = time.time() - mulai
# Interval Akurasi 95 CI 
print("Accuracy k-NN: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_kNN.mean(), scores_kNN.std() * 2, waktu))
Accuracy k-NN: 0.97 (+/- 0.09), Waktu = 0.026 detik
In [25]:
# Visualisasi untuk mengevaluasi model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN})
p = sns.boxplot(data = df_)
min(scores_kNN)
Out[25]:
0.8666666666666667
No description has been provided for this image

Klasifikasi dengan Model Regresi Logistik
¶

  • Mencari garis lurus yang sedemikian sehingga kesalahan prediksinya sekecil mungkin (lihat gambar)
  • Awalnya regresi logistik adalah metode klasifikasi binary: membedakan antara 2 kelas atau kategori.
  • Masalah klasifikasi binary contohnya memprediksi seseorang terkena "kanker" atau "tidak kanker", kanker jinak/ganas, fraud atau bukan fraud (pada transaksi keuangan), negatif/positif dalam sentimen analisis, dsb.
  • Regresi logistik adalah pengembangan dari model regresi liniear, namun di konversi ke masalah klasifikasi.

Regresi Logistik
¶

  • http://www.saedsayad.com/logistic_regression.htm
  • Makna fungsi logarithm?
  • Konsekuensi dari rumus $\beta$ diatas?
  • Asumsi?

Kaitan Regresi Logistik dan Neural Network/Deep Learning
¶

Kelebihan dan Kekurangan Regresi Logistik
¶

In [26]:
reglog = LogisticRegression().fit(df1_train, y1_train)
y_reglog1 = reglog.predict(df1_test)
print("Kasus 01 - Bunga Iris: Regresi Logistik")
print(confusion_matrix(y1_test, y_reglog1))
print(classification_report(y1_test, y_reglog1))
Kasus 01 - Bunga Iris: Regresi Logistik
[[11  0  0]
 [ 0 15  0]
 [ 0  3 16]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.83      1.00      0.91        15
   virginica       1.00      0.84      0.91        19

    accuracy                           0.93        45
   macro avg       0.94      0.95      0.94        45
weighted avg       0.94      0.93      0.93        45

In [27]:
mulai = time.time()
scores_regLog = cross_val_score(reglog, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy Regresi Logistik: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_regLog.mean(), scores_regLog.std() * 2, waktu))
Accuracy Regresi Logistik: 0.97 (+/- 0.07), Waktu = 0.164 detik
In [28]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN, 'RegLog': scores_regLog})
p = sns.boxplot(data = df_)
df_.min()
Out[28]:
kNN       0.866667
RegLog    0.933333
dtype: float64
No description has been provided for this image

Naive Bayes Classifier
¶

  • P(x) konstan, sehingga bisa diabaikan.
  • Asumsi terkuatnya adalah independensi antar variabel prediktor (sehingga dikatakan "Naive")
  • Klasifikasi dilakukan dengan menghitung probabilitas untuk setiap kategori ketika diberikan data x = (x1,x2,...,xm)
  • Variasi NBC adalah bagaimana P(c|x) dihitung, misal dengan distribusi Gaussian (Normal) - sering disebut sebagai Gaussian Naive Bayes (GNB):

  • Self readings:
  • https://www.saedsayad.com/naive_bayesian.htm
  • https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

Kelebihan dan Kekurangan Naive Bayes Classifier
¶

Pros:

  • Cepat dan mudah di implementasikan
  • Cocok untuk permasalahan multiclass
  • Jika asumsi terpenuhi (independent) biasanya performanya cukup baik dan membutuhkan data (training) yang lebih sedikit.
  • Biasanya baik digunakan untuk prediktor kategorik, untuk numerik NBC mengasumsikan distribusi normal (terkadang tidak terpenuhi) 

Cons:

  • Jika di test data memuat kategori yang tidak ada di training data ( ==> probabilitas = 0). Sering disebut sebagai masalah  “Zero Frequency”. 
  • Asumsi yang sangat kuat (independen antar prediktor).

Naive Bayes di Social Media Analytics
¶

  • Sentiment Analysis

In [29]:
# Naive Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html

gnb = GaussianNB()
nbc = gnb.fit(df1_train, y1_train)
y_nb1 = nbc.predict(df1_test)

print(confusion_matrix(y1_test, y_nb1))
print(classification_report(y1_test, y_nb1))
[[11  0  0]
 [ 0 15  0]
 [ 0  2 17]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.88      1.00      0.94        15
   virginica       1.00      0.89      0.94        19

    accuracy                           0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45

In [30]:
mulai = time.time()
scores_nb = cross_val_score(nbc, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy Naive Bayes: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_nb.mean(), scores_nb.std() * 2, waktu))
Accuracy Naive Bayes: 0.95 (+/- 0.09), Waktu = 0.024 detik
In [31]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN, 'RegLog': scores_regLog, 'NaiveBys':scores_nb})
p = sns.boxplot(data = df_)
df_.min()
Out[31]:
kNN         0.866667
RegLog      0.933333
NaiveBys    0.866667
dtype: float64
No description has been provided for this image

Decision Tree Analogi
¶

Decision Tree (Pohon Keputusan)
¶

Decision Tree (Pohon Keputusan): Contoh Aplikasi
¶

Teori Decision Tree : Entropy Formula
¶

Teori Decision Tree : Entropy Calculation
¶

Teori Decision Tree : Gain Formula
¶

Teori Decision Tree : Gain Calculation
¶

  • Contoh Lain: http://www.saedsayad.com/decision_tree.htm
  • Ross Quinlan Website: https://www.rulequest.com/Personal/

Teori Decision Tree : Information theory
¶

  • Alternative to Information Gain : Gini Index (CART): https://medium.com/deep-math-machine-learning-ai/chapter-4-decision-trees-algorithms-b93975f7a1f1

Pengaruh "ketinggian" tree terhadap bentuk model
¶

Decision Tree (Pohon Keputusan): Kelebihan & Kekurangan
¶

When to use:

  • Target : Binomial/nominal.
  • Predictors (input): binomial, nominal, and-or interval (ratio).

Advantage:

  • Fast and embarrassingly parallel.
  • Tanpa iterasi, cocok untuk Big Data technology (e.g. Hadoop)[map-reduce friendly]
  • Interpretability
  • Robust terhadap outliers & missing values

Disadvantage:

  • Non probabilistic (ad hoc heuristic) +/-
  • Target dengan banyak kelas
  • Sensitive (instability)
In [32]:
# Decision Tree: http://scikit-learn.org/stable/modules/tree.html
DT = tree.DecisionTreeClassifier() 
# Sengaja menggunakan default parameter, (Hyper)parameter Optimization akan dibahas kemudian
DT = DT.fit(df1_train, y1_train)
y_DT1 = DT.predict(df1_test)

print(confusion_matrix(y1_test, y_DT1))
print(classification_report(y1_test, y_DT1))
[[11  0  0]
 [ 0 15  0]
 [ 0  4 15]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.79      1.00      0.88        15
   virginica       1.00      0.79      0.88        19

    accuracy                           0.91        45
   macro avg       0.93      0.93      0.92        45
weighted avg       0.93      0.91      0.91        45

In [33]:
# Varible importance - Salah satu kelebihan Decision Tree
DT.feature_importances_
Out[33]:
array([0.01933984, 0.01450488, 0.06045018, 0.90570509])
In [40]:
clf = tree.DecisionTreeClassifier(random_state=0)
clf = clf.fit(df1_train, y1_train)
p = tree.plot_tree(clf)
No description has been provided for this image
In [41]:
mulai = time.time()
scores_dt = cross_val_score(DT, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy Decision Tree: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_dt.mean(), scores_dt.std() * 2, waktu))
Accuracy Decision Tree: 0.95 (+/- 0.09), Waktu = 0.024 detik
In [42]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN, 'RegLog': scores_regLog, 'NaiveBys':scores_nb, "DecTree":scores_dt})
p = sns.boxplot(data = df_)
df_.min()
Out[42]:
kNN         0.866667
RegLog      0.933333
NaiveBys    0.866667
DecTree     0.866667
dtype: float64
No description has been provided for this image

Curse of Dimensionality
¶

Curse of Dimensionality & Random Forest
¶

In [43]:
# Mari coba perbaiki dengan Random Forest
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
rf = RandomForestClassifier()
rf.fit(df1_train, y1_train)
y_rf1 = rf.predict(df1_test)

print(confusion_matrix(y1_test, y_rf1))
print(classification_report(y1_test, y_rf1))
[[11  0  0]
 [ 0 15  0]
 [ 0  3 16]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       0.83      1.00      0.91        15
   virginica       1.00      0.84      0.91        19

    accuracy                           0.93        45
   macro avg       0.94      0.95      0.94        45
weighted avg       0.94      0.93      0.93        45

In [44]:
# Varible importance
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")
for f in range(df1.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(df1.shape[1]), importances[indices],color="r", yerr=std[indices], align="center")
plt.xticks(range(df1.shape[1]), indices)
plt.xlim([-1, df1.shape[1]])
plt.show()
Feature ranking:
1. feature 3 (0.458596)
2. feature 2 (0.400527)
3. feature 0 (0.109768)
4. feature 1 (0.031108)
No description has been provided for this image
In [45]:
mulai = time.time()
scores_rf = cross_val_score(rf, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy Random Forest: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_rf.mean(), scores_rf.std() * 2, waktu))
Accuracy Random Forest: 0.96 (+/- 0.09), Waktu = 0.825 detik
In [46]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_ = pd.DataFrame({'kNN': scores_kNN, 'RegLog': scores_regLog, 'NaiveBys':scores_nb, "DecTree":scores_dt, "Forest": scores_rf})
p = sns.boxplot(data = df_)
df_.min()
Out[46]:
kNN         0.866667
RegLog      0.933333
NaiveBys    0.866667
DecTree     0.866667
Forest      0.866667
dtype: float64
No description has been provided for this image

Model yang lebih kompleks belum tentu lebih baik, Mengapa?
¶

In [47]:
# Saving Results untuk digunakan di module selanjutnya
import pickle

f = open('data/data_Module-11.pckl', 'wb')
pickle.dump((df_, df1, y1, df2A, df2B, y2), f)
f.close()
"Done"
Out[47]:
'Done'

Akhir - Introduction to Classification models
¶


Part 02: Model Klasifikasi Lanjutan
¶

  • Support Vector Machines
  • Evaluasi revisited: Underfitting & Overfitting
  • Pipelining & Parameter Optimization
  • Proper Model Selection
  • Ensemble Learning
  • Imbalance Learning
  • Studi Kasus

No description has been provided for this image

In [48]:
# Loading Modules
import warnings; warnings.simplefilter('ignore')
import pickle
import pandas as pd, matplotlib.pyplot as plt
import time, numpy as np, seaborn as sns
from sklearn import svm, preprocessing
from sklearn import  tree
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import make_pipeline 
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection
from collections import Counter
from tqdm import tqdm
sns.set(style="ticks", color_codes=True)
print(pd.__version__)
"Done"
1.5.1
Out[48]:
'Done'
In [49]:
# Mulai dengan Load Data dari Modul sebelumnya terlebih dahulu
file_ = "data/data_Module-11.pckl"
try: # Running Locally, yakinkan "file_" berada di folder "data"
    f = open(file_, 'rb')
    data = pickle.load(f); f.close()
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/{file_}
    f = open(file_, 'rb')
    data = pickle.load(f); f.close()

df_, df1, y1, df2A, df2B, y2 = data
df_.shape, df_.keys()
Out[49]:
((10, 5),
 Index(['kNN', 'RegLog', 'NaiveBys', 'DecTree', 'Forest'], dtype='object'))
In [50]:
# Akan sama dengan module sebelumnya karena nilai SEED sama.
df1_train, df1_test, y1_train, y1_test = train_test_split(df1, y1, test_size=0.3, random_state=33)
df2A_train, df2A_test, y2_train, y2_test = train_test_split(df2A, y2, test_size=0.3, random_state=33) #No One-Hot
df2B_train, df2B_test, y2_train, y2_test = train_test_split(df2B, y2, test_size=0.3, random_state=33) # One-Hot
"Done"
Out[50]:
'Done'

Support Vector Machine (SVM)
¶

Misal data dinyatakan sebagai berikut: $\{(\bar{x}_1,y_1),...,(\bar{x}_n,y_n)\}$, dimana $\bar{x}_i$ adalah input pattern untuk data ke $i^{th}$ dan $y_i$ adalah nilai target yang diinginkan. Kategori (class) direpresentasikan dengan $y_i=\{-1,1\}$. Sebuah bidang datar (hyperplane) yang memisahkan kedua kelas ini ("linearly separable") adalah: $$ \bar{w}'\bar{x} + b=0 $$ dimana $\bar{x}$ adalah input vector (prediktor), $\bar{w}$ weight, dan $b$ disebut sebagai bias.

Kelebihan Pemodelan SVM
¶

Support Vector Machine: Soft Margin
¶

  • Diselesaikan dengan "mudah" via linear/quadratic programming.
  • Fungsi ini **Convex** sehingga penyelesaiannya menghasilkan nilai Global Optimal.
  • Interpretasi: Recursive Feature Elimination (RFE) method melihat bentuk kuadrat dari setiap komponen w (higher better).

SVM Kernel (trick): $R^m \rightarrow R^n, n\geq m$
¶

Contoh Fungsi Kernel
¶

  • Misal x = (x1, x2, x3); y = (y1, y2, y3).
  • dan fungsi pemetaan variabelnya f(x) = (x1², x1x2, x1x3, x2x1, x2², x2x3, x3x1, x3x2, x3²),
  • maka kernelnya adalah K(x, y ) = <f(x), f(y)> = <x, y>².
  • Contoh numerik misal x = (1, 2, 3) dan y = (4, 5, 6). maka:
  • f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
    f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)
  • <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024
  • complicated!... Menggunakan fungsi kernel perhitungannya bisa disederhanakan:
  • K(x, y) = (4 + 10 + 18)² = 32² = 1024
  • Artinya perhitungan di dimensi yang tinggi dapat dilakukan di dimensi satu via inner product!.

Contoh Fungsi Kernel yang Populer
¶

Kelebihan dan Kekurangan SVM
¶

Pros

  • Akurasinya Baik
  • Bekerja dengan baik untuk sampel data yang relatif kecil
  • Hanya bergantung pada SV ==> meningkatkan efisiensi
  • Convex ==> Minimum Global ==> Pasti Konvergen

Cons

  • Tidak efisien untuk data yang besar
  • Akurasi terkadang rendah untuk multiklasifikasi (sulit mendapatkan hubungan antar kategori di modelnya)
  • Tidak robust terhadap noise

Bacaan lebih lanjut:

  • https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72
  • Contoh Perhitungan Manual: https://slideplayer.info/slide/3672979/?fbclid=IwAR3Tteg_PbKwkBxV63FGfat3o9UBfHBnjvGHwlyYcrxKTWeb6gfsSpBAQBE
In [51]:
# Fitting and evaluate the model
dSVM = svm.SVC(C = 10**5, kernel = 'linear')#Misal menggunakan kernel Linear

dSVM.fit(df1_train, y1_train)
y_SVM1 = dSVM.predict(df1_test)

print(confusion_matrix(y1_test, y_SVM1))
print(classification_report(y1_test, y_SVM1))
[[11  0  0]
 [ 0 15  0]
 [ 0  0 19]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        11
  versicolor       1.00      1.00      1.00        15
   virginica       1.00      1.00      1.00        19

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

In [52]:
# The Support Vectors
print('index dr SV-nya: ', dSVM.support_)
print('Vector Datanya: \n', dSVM.support_vectors_)
index dr SV-nya:  [14 41 83  0 12 80 93 15 18 31 42 43 88]
Vector Datanya: 
 [[4.5 2.3 1.3 0.3]
 [5.1 3.8 1.9 0.4]
 [5.1 3.3 1.7 0.5]
 [5.1 2.5 3.  1.1]
 [5.9 3.2 4.8 1.8]
 [6.  2.7 5.1 1.6]
 [6.7 3.  5.  1.7]
 [6.3 2.8 5.1 1.5]
 [7.2 3.  5.8 1.6]
 [6.1 3.  4.9 1.8]
 [6.5 3.2 5.1 2. ]
 [6.3 2.7 4.9 1.8]
 [4.9 2.5 4.5 1.7]]
In [53]:
# Model Weights for interpretations
print('w = ',dSVM.coef_)
print('b = ',dSVM.intercept_)
w =  [[-0.04630589  0.52106895 -1.00301941 -0.46411937]
 [ 0.04017805  0.17410509 -0.55713561 -0.2437469 ]
 [ 3.71728259  3.70419407 -7.34998017 -8.65277018]]
b =  [ 1.45332688  1.28948112 17.22405189]
In [54]:
# Menggunakan Kernel: http://scikit-learn.org/stable/modules/svm.html#svm-kernels
for kernel in ('sigmoid', 'poly', 'rbf', 'linear'):
    dSVM = svm.SVC(kernel=kernel)
    dSVM.fit(df1_train, y1_train)
    y_SVM = dSVM.predict(df1_test)
    print(accuracy_score(y1_test, y_SVM))
0.24444444444444444
0.9777777777777777
0.9333333333333333
0.9555555555555556
In [55]:
dSVM = svm.SVC(C = 10**5, kernel = 'linear')
mulai = time.time()
scores_svm = cross_val_score(dSVM, df1, y1, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
print("Accuracy SVM: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_svm.mean(), scores_svm.std() * 2, waktu))
Accuracy SVM: 0.98 (+/- 0.09), Waktu = 0.094 detik
In [56]:
# Visualisasi untuk mengevaluasi & membandingkan model dengan lebih baik lagi
df_['SVM'] = scores_svm
p = sns.boxplot(data = df_)
df_.min()
Out[56]:
kNN         0.866667
RegLog      0.933333
NaiveBys    0.866667
DecTree     0.866667
Forest      0.866667
SVM         0.866667
dtype: float64
No description has been provided for this image

Induktif Bias
¶

  • Bias penaksiran parameter (statistik)
  • Induktif Bias Sample (Machine Learning - Tom Mitchel)
  • Induktif Bias Pemilihan Classifier (Statistical Learning Theory - Vapnik)

(Hyper)Parameter Optimization
¶

  • Perbandingan yang baru saja kita lakukan walau sudah CV, namun belum sepenuhnya valid.
  • Saat membandingkan model, maka kita harus meyakinkan seluruh model mendapatkan parameternya yang optimal.

In [57]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
file = 'data/diabetes_data.csv'

try:
    # Local jupyter notebook, assuming "file" is in the "data" directory
    data = pd.read_csv(file, names=names)
except:
    # it's a google colab... create folder data and then download the file from github
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/{file}
    data = pd.read_csv(file, names=names)
    
print(data.shape, set(data['class']))
data.sample(5)
(768, 9) {0, 1}
Out[57]:
preg plas pres skin test mass pedi age class
424 8 151 78 32 210 42.9 0.516 36 1
77 5 95 72 33 0 37.7 0.370 27 0
322 0 124 70 20 0 27.4 0.254 36 1
443 8 108 70 0 0 30.5 0.955 33 1
699 4 118 70 0 0 44.5 0.904 26 0
In [58]:
# Split Train-Test

X = data.values[:,:8]  # Slice data (perhatikan disini struktur data adalah Numpy Array)
Y = data.values[:,8]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=99)

print(set(Y), x_train.shape, x_test.shape, sep=', ')
{0.0, 1.0}, (614, 8), (154, 8)

Kita Jalankan Terlebih Dahulu dengan "Default Parameter"
¶

In [59]:
clf = LogisticRegression(solver='liblinear')
kNN = neighbors.KNeighborsClassifier()
gnb = GaussianNB()
dt = tree.DecisionTreeClassifier()
rf = RandomForestClassifier()
svm_ = svm.SVC()

Models = [('Regresi Logistik', clf), ('k-NN',kNN), ('Naive Bayes',gnb), ('Decision Tree', dt), ('Random Forest', rf), ('SVM', svm_)]
Scores = {}
for model_name, model in tqdm(Models):
    Scores[model_name] = cross_val_score(model, x_train, y_train, cv=10, scoring='accuracy')

fig, ax = plt.subplots(1, 1, figsize=(10, 8))
dt = pd.DataFrame.from_dict(Scores)
ax = sns.boxplot(data=dt, ax=ax)
for m, s in Scores.items():
    print(m, list(s)[:4])
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:01<00:00,  4.84it/s]
Regresi Logistik [0.6290322580645161, 0.8225806451612904, 0.7741935483870968, 0.7096774193548387]
k-NN [0.6935483870967742, 0.6935483870967742, 0.7258064516129032, 0.5967741935483871]
Naive Bayes [0.7096774193548387, 0.8387096774193549, 0.6774193548387096, 0.7580645161290323]
Decision Tree [0.6935483870967742, 0.7419354838709677, 0.6935483870967742, 0.7419354838709677]
Random Forest [0.7096774193548387, 0.8064516129032258, 0.8064516129032258, 0.7741935483870968]
SVM [0.7096774193548387, 0.7903225806451613, 0.6935483870967742, 0.7258064516129032]

No description has been provided for this image

Hyperparameter Optimization
¶

  • Misal akan dicontohkan dua algoritma (model) yang sudah kita bahas sebelumnya: k-NN dan SVM
  • Sebagai latihan silahkan untuk mencoba HO pada model yang lain.
  • Parameter tiap model di ML berbeda-beda dan nilai optimalnya berbeda pada setiap kasus.

In [60]:
# Hyperparameter optimization pada model kNN menggunakan gridCV
kCV = 10
metric = 'accuracy'
params = {}
params['kneighborsclassifier__n_neighbors'] = [1, 3, 5, 10, 15, 20, 25, 30]
params['kneighborsclassifier__weights'] = ('distance', 'uniform')

pipe = make_pipeline(neighbors.KNeighborsClassifier())
optKnn = GridSearchCV(pipe, params, cv=kCV, scoring=metric, verbose=1, n_jobs=-2) #
optKnn.fit(x_train, y_train)
print(optKnn.best_score_)
print(optKnn.best_params_)
Fitting 10 folds for each of 16 candidates, totalling 160 fits
0.7297726070861978
{'kneighborsclassifier__n_neighbors': 20, 'kneighborsclassifier__weights': 'uniform'}
In [61]:
# Contoh Hyperparameter optimization pada model SVM menggunakan RandomizedSearchCV
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
# Berikut ini contoh bagaimana mengetahui parameter yang dapat kita optimasi.
# Gunakan pengetahuan teori/analitik untuk mengoptimasi hanya parameter yang paling penting.
pipeSVM = make_pipeline(svm.SVC())
print(sorted(pipeSVM.get_params().keys()))
['memory', 'steps', 'svc', 'svc__C', 'svc__break_ties', 'svc__cache_size', 'svc__class_weight', 'svc__coef0', 'svc__decision_function_shape', 'svc__degree', 'svc__gamma', 'svc__kernel', 'svc__max_iter', 'svc__probability', 'svc__random_state', 'svc__shrinking', 'svc__tol', 'svc__verbose', 'verbose']
In [62]:
# Optimal parameter SVM dengan RandomizedSearch
# WARNING cell ini butuh waktu komputasi cukup lama
kCV = 10
paramsSVM = {}
paramsSVM['svc__C'] = [1, 10, 100, 1000] #sp.stats.uniform(scale=100)
paramsSVM['svc__gamma'] = [0.1, 0.001, 0.0001, 1, 10]
paramsSVM['svc__kernel'] = ['rbf', 'sigmoid', 'linear'] # , 'poly'
optSvm = RandomizedSearchCV(pipeSVM, paramsSVM, cv=kCV, scoring=metric, verbose=2, n_jobs=-2) # refit=True, pre_dispatch='2*n_jobs' pre_dispatch min 2* n_jobs
optSvm.fit(x_train, y_train)
print(optSvm.best_score_)
print(optSvm.best_params_)
Fitting 10 folds for each of 10 candidates, totalling 100 fits
0.7736118455843469
{'svc__kernel': 'linear', 'svc__gamma': 1, 'svc__C': 1000}

Model Selection¶

In [63]:
kCV = 10
# Menggunakan parameter optimal
kNN = neighbors.KNeighborsClassifier(n_neighbors= 20, weights= 'uniform')
svm_ = svm.SVC(kernel= 'linear', gamma= 10, C= 10)

# Melakukan Cross Validasi
models = ['kNN', 'SVM']
knn_score = cross_val_score(kNN, x_test, y_test, cv=kCV, scoring='accuracy', n_jobs=-2, verbose=1)
svm_score = cross_val_score(svm_, x_test, y_test, cv=kCV, scoring='accuracy', n_jobs=-2, verbose=1)
scores = [knn_score, svm_score]

data = {m:s for m,s in zip(models, scores)}
for name in data.keys():
    print("Accuracy %s: %0.2f (+/- %0.2f)" % (name, data[name].mean(), data[name].std() * 2))

fig, ax = plt.subplots(1, 1, figsize=(8, 6))
p = sns.boxplot(data=pd.DataFrame(data), ax=ax)
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 15 concurrent workers.
[Parallel(n_jobs=-2)]: Done   3 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-2)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 15 concurrent workers.
[Parallel(n_jobs=-2)]: Done   3 out of  10 | elapsed:    1.2s remaining:    2.9s
Accuracy kNN: 0.71 (+/- 0.17)
Accuracy SVM: 0.78 (+/- 0.20)
[Parallel(n_jobs=-2)]: Done  10 out of  10 | elapsed:    7.0s finished
No description has been provided for this image

Ensemble Model
¶

  • What? a learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.
  • Why? Better prediction, More stable model
  • How? Bagging & Boosting

“meta-algorithms” : Bagging & Boosting
¶

  • Ensemble https://www.youtube.com/watch?v=Un9zObFjBH0
  • Bagging https://www.youtube.com/watch?v=2Mg8QD0F1dQ
  • Boosting https://www.youtube.com/watch?v=GM3CDQfQ4sw

Boosting in ML
¶

Property of Boosting
¶

AdaBoost
¶

  • https://youtu.be/BoGNyWW9-mE?t=70
In [64]:
# Contoh Voting (Bagging) di Python
# Catatan : Random Forest termasuk Bagging Ensemble (walau modified)
# Best practicenya Model yang di ensemble semuanya menggunakan Optimal Parameter

kNN = neighbors.KNeighborsClassifier(3)
kNN.fit(x_train, y_train)
Y_kNN = kNN.score(x_test, y_test)

DT = tree.DecisionTreeClassifier(random_state=1)
DT.fit(x_train, y_train)
Y_DT = DT.score(x_test, y_test)

model = VotingClassifier(estimators=[('k-NN', kNN), ('Decision Tree', DT)], voting='hard')
model.fit(x_train,y_train)
Y_Vot = model.score(x_test,y_test)

print('Akurasi k-NN', Y_kNN)
print('Akurasi Decision Tree', Y_DT)
print('Akurasi Votting', Y_Vot)
Akurasi k-NN 0.7142857142857143
Akurasi Decision Tree 0.6818181818181818
Akurasi Votting 0.7337662337662337
In [65]:
# Averaging juga bisa digunakan di Klasifikasi (ndak hanya Regresi), 
# tapi kita pakai probabilitas dari setiap kategori
T = tree.DecisionTreeClassifier()
K = neighbors.KNeighborsClassifier()
R = LogisticRegression()

T.fit(x_train,y_train)
K.fit(x_train,y_train)
R.fit(x_train,y_train)

y_T=T.predict_proba(x_test)
y_K=K.predict_proba(x_test)
y_R=R.predict_proba(x_test)

Ave = (y_T+y_K+y_R)/3
print(Ave[:5]) # Print just first 5
prediction = [v.index(max(v)) for v in Ave.tolist()]
print(prediction[:5]) # Print just first 5
print('Akurasi Averaging', accuracy_score(y_test, prediction))
[[0.86747806 0.13252194]
 [0.96569617 0.03430383]
 [0.90409318 0.09590682]
 [0.81735063 0.18264937]
 [0.97683156 0.02316844]]
[0, 0, 0, 0, 0]
Akurasi Averaging 0.7467532467532467
In [66]:
# AdaBoost
num_trees = 100
kfold = model_selection.KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=33)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
0.7421565276828435

Imbalance Data
¶

  • Metric Trap
  • Akurasi kategori tertentu lebih penting
  • Contoh kasus

Imbalance Learning
¶

  • Undersampling, Oversampling, Model Based (weight adjustment)
  • https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
  • Plot perbandingan: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html#sphx-glr-auto-examples-combine-plot-comparison-combine-py
In [67]:
Counter(Y)
Out[67]:
Counter({1.0: 268, 0.0: 500})
In [68]:
# fit the model and get the separating hyperplane using weighted classes

svm_ = svm.SVC(kernel='linear')
svm_.fit(x_train, y_train)
y_SVMib = svm_.predict(x_test)

print(confusion_matrix(y_test, y_SVMib))
print(classification_report(y_test, y_SVMib))
[[93 12]
 [19 30]]
              precision    recall  f1-score   support

         0.0       0.83      0.89      0.86       105
         1.0       0.71      0.61      0.66        49

    accuracy                           0.80       154
   macro avg       0.77      0.75      0.76       154
weighted avg       0.79      0.80      0.79       154

In [69]:
# fit the model and get the separating hyperplane using weighted classes
# x_train, x_test, y_train, y_test

svm_balanced = svm.SVC(kernel='linear', class_weight={1: 3}) #WEIGHTED SVM
svm_balanced.fit(x_train, y_train)
y_SVMb = svm_balanced.predict(x_test)

print(confusion_matrix(y_test, y_SVMb))
print(classification_report(y_test, y_SVMb))
[[67 38]
 [ 7 42]]
              precision    recall  f1-score   support

         0.0       0.91      0.64      0.75       105
         1.0       0.53      0.86      0.65        49

    accuracy                           0.71       154
   macro avg       0.72      0.75      0.70       154
weighted avg       0.78      0.71      0.72       154

In [70]:
# Example of model-based imbalance treatment - SVM
from sklearn.datasets import make_blobs
n_samples_1, n_samples_2 = 1000, 100
centers = [[0.0, 0.0], [2.0, 2.0]]
clusters_std = [1.5, 0.5]
X, y = make_blobs(n_samples=[n_samples_1, n_samples_2],centers=centers,cluster_std=clusters_std,random_state=33, shuffle=False)

# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)

# fit the model and get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10}) #WEIGHTED SVM
wclf.fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')# plot the samples
ax = plt.gca()# plot the decision functions for both classifiers
xlim = ax.get_xlim(); ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)# create grid to evaluate model
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane
a = ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5, linestyles=['-']) # plot decision boundary and margins
Z = wclf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane for weighted classes
b = ax.contour(XX, YY, Z, colors='r', levels=[0], alpha=0.5, linestyles=['-'])# plot decision boundary and margins for weighted classes
plt.legend([a.collections[0], b.collections[0]], ["non weighted", "weighted"], loc="upper right")
plt.show()
No description has been provided for this image

Weighted Decision Tree
¶

In [71]:
T = tree.DecisionTreeClassifier(random_state = 33)
T.fit(x_train,y_train)
y_DT = T.predict(x_test)
print('Akurasi  (Decision tree Biasa) = ', accuracy_score(y_test, y_DT))
print(classification_report(y_test, y_DT))

T = tree.DecisionTreeClassifier(class_weight = 'balanced', random_state = 33)
T.fit(x_train, y_train)
y_DT = T.predict(x_test)
print('Akurasi  (Weighted Decision tree) = ', accuracy_score(y_test, y_DT))
print(classification_report(y_test, y_DT))
Akurasi  (Decision tree Biasa) =  0.6883116883116883
              precision    recall  f1-score   support

         0.0       0.79      0.73      0.76       105
         1.0       0.51      0.59      0.55        49

    accuracy                           0.69       154
   macro avg       0.65      0.66      0.65       154
weighted avg       0.70      0.69      0.69       154

Akurasi  (Weighted Decision tree) =  0.7207792207792207
              precision    recall  f1-score   support

         0.0       0.83      0.74      0.78       105
         1.0       0.55      0.67      0.61        49

    accuracy                           0.72       154
   macro avg       0.69      0.71      0.69       154
weighted avg       0.74      0.72      0.73       154

Studi Kasus (Latihan) ENB2012: Prediksi Penggunaan Energi Gedung
¶

Task

  • Filter data EcoTest dan pilih hanya yang kategori di variabel target muncul min 10 kali (heat-cat)
  • Lakukan EDA (Preprocessing dan visualisasi dasar)
  • Tentukan model terbaik (dengan parameter optimal dan cross validasi)
  • Hati-hati Naive Bayes, Decision Tree dan Random Forest tidak memerlukan one-hot encoding.
  • Gunakan Metric Micro F1-Score untuk menentukan model terbaiknya.

Optional

  • Coba bandingkan model terbaik diatas dengan model ensemble.
  • Apakah ada imbalance problem, coba atasi dengan over/under sampling.
In [72]:
file_ = "data/building-energy-efficiency-ENB2012_data.csv"

try: # Running Locally, yakinkan "file_" berada di folder "data"
    data = pd.read_csv(file_)
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/{file_}
    data = pd.read_csv(file_)
print(data.shape)
data.sample(5)
(768, 12)
Out[72]:
compactness surface-area wall-area roof-area overall-height orientation glazing-area glazing-dist heating-load cooling-load heat-cat cool-cat
488 0.86 588.0 294.0 147.0 7.0 2 0.25 5 29.71 28.02 29 28
29 0.71 710.5 269.5 220.5 3.5 3 0.00 0 6.40 11.72 6 11
278 0.66 759.5 318.5 220.5 3.5 4 0.10 5 11.22 14.65 11 14
399 0.82 612.5 318.5 147.0 7.0 5 0.25 3 25.17 26.41 25 26
7 0.90 563.5 318.5 122.5 7.0 5 0.00 0 19.68 29.60 19 29
In [73]:
# Jawaban Latihan dimulai di cell ini

Akhir Modul Model Klasifikasi Lanjutan
¶